In this project, we will explore the financial contributions to 2016 US Presidential candidates of California (CA). We try to unconver the statistics in contribution data, as well as its relationship with the voters’ background and election result of CA usning exploratory data analysis techiniques.
The data set contains about 1M contributions records from 2014 to 2016 of CA, it contains 58 counties in CA, 25 condidates of Democrat, Republican and other parties.
In this section, we audit, clean and manupulate data to preapre for later exploration. Data will be firstly loaded and preprocessed sperately, then several dataframes will be built and used in the following sections.
We will load and take glimps of the data rows. After it the data will be preprocessed by dropping unused part, renaming columns, covnertting data types, and formatting.
## 'data.frame': 1125660 obs. of 19 variables:
## $ V1 : Factor w/ 26 levels "C00458844","C00500587",..: 26 6 6 6 7 7 7 7 6 7 ...
## $ V2 : Factor w/ 26 levels "cand_id","P00003392",..: 1 2 2 2 13 13 13 13 2 13 ...
## $ V3 : Factor w/ 26 levels "Bush, Jeb","cand_nm",..: 2 5 5 5 21 21 21 21 5 21 ...
## $ V4 : Factor w/ 195944 levels " ALERIS, ANNAKIM",..: 33897 6555 26622 59281 99849 101095 101095 101122 78047 101139 ...
## $ V5 : Factor w/ 2118 levels "","-4086",".",..: 399 943 260 607 259 1505 1505 2003 907 2055 ...
## $ V6 : Factor w/ 2 levels "CA","contbr_st": 2 1 1 1 1 1 1 1 1 1 ...
## $ V7 : Factor w/ 134035 levels "","00000","000090272",..: 134033 107295 66913 50011 61864 16081 16081 42612 54660 57711 ...
## $ V8 : Factor w/ 57984 levels ""," APPLE INC.",..: 12079 NA NA NA 3977 54916 54916 36582 NA 35437 ...
## $ V9 : Factor w/ 25387 levels ""," ATTORNEY",..: 4972 18993 18993 18993 21225 15921 15921 17407 18993 14744 ...
## $ V10: Factor w/ 9578 levels "-.01","-.24",..: 9578 7397 4167 7348 6763 6333 1478 5112 6763 1431 ...
## $ V11: Factor w/ 660 levels "01-APR-15","01-APR-16",..: 660 540 408 25 78 99 121 78 408 99 ...
## $ V12: Factor w/ 75 levels "","* EARMARKED CONTRIBUTION: SEE BELOW REATTRIBUTION/REFUND PENDING",..: 56 1 1 1 1 1 1 1 1 1 ...
## $ V13: Factor w/ 3 levels "","memo_cd","X": 2 3 3 3 1 1 1 1 3 1 ...
## $ V14: Factor w/ 424 levels "","*","* $550 REFUNDED 6/16/16",..: 273 39 39 39 4 4 4 4 39 4 ...
## $ V15: Factor w/ 4 levels "form_tp","SA17A",..: 1 3 3 3 2 2 2 2 3 2 ...
## $ V16: Factor w/ 187 levels "1003942","1004025",..: 187 141 141 141 106 106 106 106 141 106 ...
## $ V17: Factor w/ 1122206 levels "A000771210424405B8CF",..: 714265 327338 326620 324002 858678 860122 862423 858140 326658 860118 ...
## $ V18: Factor w/ 5 levels "","election_tp",..: 2 4 4 4 4 4 4 4 4 4 ...
## $ V19: logi NA NA NA NA NA NA ...
candidate name, contribution city, zip, amount, date, contributor's employer and occupationdate from string to dateamount and zip to numberscity, candidate, employer, occupationcheck for invalid values such as N/A
load and preprocess data
check state equals to CA
NA in each columnsemployer we only care the top ranks, minor numbers of NA will be ignoredfor occupation we will decide if further cleaning is necessary after more exploration
## candidate city zip employer occupation amount
## 0 1 102 0 0 0
## date
## 0
We will use the family income data to gain more background of the contributors and voters in CA. The data comes from US 2014 census. The preprocessing follow similar steps as contribution data.
## 'data.frame': 60 obs. of 9 variables:
## $ GEO.id : Factor w/ 2 levels "0400000US06",..: 2 1 1 1 1 1 1 1 1 1 ...
## $ GEO.id2 : Factor w/ 2 levels "06","Id2": 2 1 1 1 1 1 1 1 1 1 ...
## $ GEO.display.label : Factor w/ 2 levels "California","Geography": 2 1 1 1 1 1 1 1 1 1 ...
## $ GCT_STUB.target.geo.id : Factor w/ 60 levels "0400000US06",..: 60 1 2 3 4 5 6 7 8 9 ...
## $ GCT_STUB.target.geo.id2 : Factor w/ 60 levels "06","06001","06003",..: 60 1 2 3 4 5 6 7 8 9 ...
## $ GCT_STUB.display.label : Factor w/ 60 levels "California","California - Alameda County",..: 60 1 2 3 4 5 6 7 8 9 ...
## $ GCT_STUB.display.label.1: Factor w/ 60 levels "Alameda County",..: 12 6 1 2 3 4 5 7 8 9 ...
## $ HC01 : Factor w/ 60 levels "106401","108088",..: 60 41 57 43 38 24 37 23 59 17 ...
## $ HC02 : Factor w/ 60 levels "1029","1030",..: 60 28 57 5 46 17 44 39 4 42 ...
county name and family_income## county family_income
## 3 Alameda 90822
## 4 Alpine 71932
## 5 Amador 68765
## 6 Butte 56934
## 7 Calaveras 67100
## 8 Colusa 56472
## 'data.frame': 58 obs. of 4 variables:
## $ Vote.by.county: Factor w/ 58 levels "Alameda","Alpine",..: 19 30 37 33 36 43 1 34 7 38 ...
## $ Clinton : Factor w/ 58 levels "1,601,382","1,614",..: 1 35 34 24 22 31 29 13 20 23 ...
## $ Trump : Factor w/ 58 levels "1,030","1,989",..: 46 31 24 22 15 56 49 8 51 23 ...
## $ Rpt. : Factor w/ 1 level "100%": 1 1 1 1 1 1 1 1 1 1 ...
county name and votes for Clinton and votes for Trump## county Clinton Trump
## 1 Los Angeles 1601382 525308
## 2 Orange 395801 356892
## 3 San Diego 390531 269422
## 4 Riverside 228731 219359
## 5 San Bernardino 220853 188080
## 6 Santa Clara 312994 91990
## 'data.frame': 25 obs. of 3 variables:
## $ candidate: Factor w/ 25 levels "Bush","Carson",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ full.name: Factor w/ 25 levels "Bush, Jeb","Carson, Benjamin S.",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ party : Factor w/ 3 levels "Democrat","Other",..: 3 3 3 1 3 3 3 3 3 3 ...
## candidate full.name party
## 1 Bush Bush, Jeb Republican
## 2 Carson Carson, Benjamin S. Republican
## 3 Christie Christie, Christopher J. Republican
## 4 Clinton Clinton, Hillary Rodham Democrat
## 5 Cruz Cruz, Rafael Edward 'Ted' Republican
## 6 Fiorina Fiorina, Carly Republican
## zip city state latitude longitude fips
## 1 00210 Portsmouth NH 43.0059 -71.0132 33015
## 2 00211 Portsmouth NH 43.0059 -71.0132 33015
## 3 00212 Portsmouth NH 43.0059 -71.0132 33015
## 4 00213 Portsmouth NH 43.0059 -71.0132 33015
## 5 00214 Portsmouth NH 43.0059 -71.0132 33015
## 6 00215 Portsmouth NH 43.0059 -71.0132 33015
## county_name state state_fips county_fips fips_class CSA CBSA
## 1 Autauga County AL 01 001 H1 <NA> 33860
## 2 Baldwin County AL 01 003 H1 380 19300
## 3 Barbour County AL 01 005 H1 <NA> <NA>
## 4 Bibb County AL 01 007 H1 142 13820
## 5 Blount County AL 01 009 H1 142 13820
## 6 Bullock County AL 01 011 H1 <NA> <NA>
## population
## 1 54571
## 2 182265
## 3 27457
## 4 22915
## 5 57322
## 6 10914
fips## zip city state latitude longitude fips
## 39141 89439 Verdi CA 39.52369 -120.0301 6091
## 39206 90001 Los Angeles CA 33.97291 -118.2488 6037
## 39207 90002 Los Angeles CA 33.94832 -118.2485 6037
## 39208 90003 Los Angeles CA 33.96271 -118.2760 6037
## 39209 90004 Los Angeles CA 34.07711 -118.3076 6037
## 39210 90005 Los Angeles CA 34.05891 -118.3085 6037
income, counties to votes, as they all belong to information of each county## county Clinton Trump family_income
## Length:58 Min. : 318 Min. : 211 Min. : 45296
## Class :character 1st Qu.: 5149 1st Qu.: 5643 1st Qu.: 52789
## Mode :character Median : 21155 Median : 14604 Median : 62945
## Mean : 94515 Mean : 51133 Mean : 66744
## 3rd Qu.: 73545 3rd Qu.: 49241 3rd Qu.: 75677
## Max. :1601382 Max. :525308 Max. :120030
## population
## Min. : 1175
## 1st Qu.: 48001
## Median : 179140
## Mean : 642310
## 3rd Qu.: 642593
## Max. :9818605
## county Clinton Trump family_income population
## 1 Alameda 300185 59384 90822 1510271
## 2 Alpine 318 211 71932 1175
## 3 Amador 3420 6069 68765 38091
## 4 Butte 31638 35354 56934 220000
## 5 Calaveras 6225 10126 67100 45578
## 6 Colusa 1832 2709 56472 21419
zip_codescontributions and votes or counties## zip city county
## 1 94604 Oakland Alameda
## 2 94551 Livermore Alameda
## 3 94542 Hayward Alameda
## 4 94621 Oakland Alameda
## 5 94557 Hayward Alameda
## 6 94623 Oakland Alameda
zip_codes and contributionscounty_by_zip and county_by_citycounty_by_zip is NA, then we believe it’s valid and will fix city by looking up the right zip in zip_codescounty_by_city is valid, similar to type 1NA, then we lack the information about the truth, we will keep the rows except for statistics when county is involvedNA, it should be dropped as invalid data## candidate city zip
## 5 Sanders Camarillo 93011
## 39 Clinton Palo Alto 94301
## 105 Cruz Kagel Canyon 91342
## 194 Clinton Playa Vista 90094
## 201 Sanders Goleta 93116
## 255 Sanders Palo Alto 94301
## 310 Sanders Palo Alto 94303
## 338 Clinton Montecito 93108
## 355 Sanders Palo Alto 94303
## 360 Clinton Rolling Hills Estates 90274
## employer occupation amount
## 5 At&T Government Solutions Software Engineer 40
## 39 NA Retired 500
## 105 City Of Huntington Beach Civil Servant 35
## 194 NA Retired 300
## 201 Impulse Network Engineer 100
## 255 Ca Dept Of Justice Attorney 27
## 310 Palo Alto Medical Foundation Radiologic Technologist 15
## 338 Manageamerica Integrated Billing Servi Business Owner 250
## 355 Self Employed Self Employed 50
## 360 Retired Former Banker 100
## date county_by_zip county_by_city
## 5 2016-03-04 <NA> Ventura
## 39 2016-04-15 Santa Clara San Mateo
## 105 2016-04-05 Los Angeles <NA>
## 194 2016-04-04 Los Angeles <NA>
## 201 2016-03-06 <NA> Santa Barbara
## 255 2016-03-04 Santa Clara San Mateo
## 310 2016-03-06 Santa Clara San Mateo
## 338 2016-04-07 Santa Barbara <NA>
## 355 2016-03-06 Santa Clara San Mateo
## 360 2016-04-04 Los Angeles <NA>
## [1] 44016
## [1] 13999
## [1] 247
fix the first two types of error
then final number of invalid county entries
## [1] 14246
## candidate city zip employer
## 39 Clinton Palo Alto 94301 NA
## 255 Sanders Palo Alto 94301 Ca Dept Of Justice
## 310 Sanders Palo Alto 94303 Palo Alto Medical Foundation
## 355 Sanders Palo Alto 94303 Self Employed
## 453 Clinton Westlake Village 91361 Retired
## 602 Clinton Tracy 95377 Mental Health Counselor
## occupation amount date county_by_zip county_by_city
## 39 Retired 500 2016-04-15 Santa Clara San Mateo
## 255 Attorney 27 2016-03-04 Santa Clara San Mateo
## 310 Radiologic Technologist 15 2016-03-06 Santa Clara San Mateo
## 355 Self Employed 50 2016-03-06 Santa Clara San Mateo
## 453 Teacher 10 2016-04-24 Ventura Los Angeles
## 602 Retired 10 2016-04-29 San Joaquin Alameda
## county
## 39 <NA>
## 255 <NA>
## 310 <NA>
## 355 <NA>
## 453 <NA>
## 602 <NA>
city, zip can be droppedThis is convenient since dataset is not very large, otherwise, it should be stored in several tables as SQL database.
add party of contributions
contributions is the main dataframe for exploration## candidate employer occupation amount date
## 1 Clinton NA Retired 50 2016-04-26
## 2 Clinton NA Retired 200 2016-04-20
## 3 Clinton NA Retired 5 2016-04-02
## 4 Sanders At&T Government Solutions Software Engineer 40 2016-03-04
## 5 Sanders Vericor Enterprises Inc. Pharmacist 35 2016-03-05
## 6 Sanders Vericor Enterprises Inc. Pharmacist 100 2016-03-06
## county party
## 1 Marin Democrat
## 2 San Luis Obispo Democrat
## 3 San Bernardino Democrat
## 4 Ventura Democrat
## 5 Los Angeles Democrat
## 6 Los Angeles Democrat
## candidate employer occupation
## Length:1125659 Length:1125659 Length:1125659
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
## amount date county
## Min. :-10500.0 Min. :2013-11-05 Length:1125659
## 1st Qu.: 15.0 1st Qu.:2016-02-29 Class :character
## Median : 27.0 Median :2016-05-03 Mode :character
## Mean : 121.8 Mean :2016-05-03
## 3rd Qu.: 97.0 3rd Qu.:2016-08-01
## Max. : 10800.0 Max. :2016-10-19
## party
## Length:1125659
## Class :character
## Mode :character
##
##
##
## candidate employer occupation amount date county
## 0 0 0 0 0 14246
## party
## 0
votes contains the background information and election results by county## county Clinton Trump family_income population
## 1 Alameda 300185 59384 90822 1510271
## 2 Alpine 318 211 71932 1175
## 3 Amador 3420 6069 68765 38091
## 4 Butte 31638 35354 56934 220000
## 5 Calaveras 6225 10126 67100 45578
## 6 Colusa 1832 2709 56472 21419
In this part, we will use the univariate plots to show basic statistics of the data.
$100, with some peaks at $250 and $500 group and summarize contribution by county, then merge with votes
counties with top counts of contribution are Los Angeles, San Diego, Alameda, San Francisco and Orange.
they are also among the counties with top 10 population
group and summarize contribution by candidate, then merge with candidates
Republicians received most contributions in CA, more than four times larger than the total of others.
employer data.numbers of unique employer and NA
## [1] 57977
## [1] 19951
But we can still see the patterns that most counts of contribution come from the unemployed, self-employed, retired, self-employed, unversity stuff or student, as well as software and Internet industry.
The situation of contribution count vs occupation is similar to employer, with the variety decreases nealy by a half.
unique numbers of occupation type and NA
## [1] 25386
## [1] 7239
contributions contains all contribution related information.votes table contains background information such as income, population by county. It also has the votes result of presendential election of CA by county.contributions and dataframes grouped and summarized based on it, for example contributions.by_candidate, contributions.by_couty, and contributions.by_date.In this part, we further analyze the datasets, focusing on the following questions.
As studied in previous plots part, there are mainly two patterns.
In this section, we will focus on study of final candidates Hillary and Trump, or the Democrat and Republican.
homemaker, retired, enginner, sales, ceo, and self-employed.homemaker, retired, enginner, sales, ceo, and self-employed.contributions, votes, and so on.